This week I learned about clustering and classification. How to cluster observations, how to study which factors affect or justify clustering, how many clusters is appropriate etc?
This week’s data comes from an R package called MASS:
# access the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The dataset consist of 14 variables and 506 observations. All variables are numerical. One variable (‘chas’) is a 1/0, presence/absence dummy variable. The variables describe housing values in suburbs of Boston and factors measured at the suburbs which are thought to be related with housing values. Factors include measures of for example crime rate, access to Charles River, nitrogen oxides concentration, average number of rooms per dwelling, distances to five Boston employment centres, accessibility to radial highways, proportion of blacks by town and median value of owner-occupied homes. The full details can be found here.
Let’s explore graphically the distributions and relations of the data:
pairs(Boston)
This plot is difficult to read. I’ll figure out later how to improve the quality of the output. From the summary table I’m however able to explore the variation and distributions of the variables.
Here are two dotplots of variables ‘crim’ (per capita crime rate by town) and ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft.) which show that these two variables are not very evenly distributed:
dotchart(Boston$crim)
dotchart(Boston$zn)
Some variables seem correlated. Here’s a correlation matrix of the variables:
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(magrittr)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>%round(digits=2)
# print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
The above matrix is not very readable as it extends into two separate parts. Let’s present the correlations in a nicer way.
# visualize the correlation matrix
corrplot(cor_matrix, method="circle",type="upper",cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
This plot is easier to read. The bigger the circle the more correlated the variables are. Red indicates negative correlation and blue indicated positive correlation.
Some of the variables have very high values and wide distributions. We want to scale all variables because later on it may be difficult to sum or average variables that are on different scales. Scaling can be done to all variables in the dataset as they are all numerical.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
Now all the variables have their mean at zero and their distributions are more moderate.
Next I create a categorical variable of the crime rate in the Boston dataset. I use quantiles as the break points. I drop the old crime rate variable from the dataset.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
For later model evaluation purposes I divide the dataset into training and testing datasets, so that 80% of the data belongs to the train set:
##dividing the data into training and testing sets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next I want to know which variables might explain the target variable crime rate. I do a linear discriminant analysis with the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables:
# linear discriminant analysis
lda.fit <- lda(crime~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2425743 0.2475248 0.2722772
##
## Group means:
## zn indus chas nox rm
## low 1.0650461 -0.9351601 -0.14929469 -0.9401011 0.41978615
## med_low -0.1081275 -0.2400726 -0.03128211 -0.5476584 -0.10025447
## med_high -0.3740445 0.2163477 0.04263895 0.4022771 -0.04516597
## high -0.4872402 1.0169558 -0.05757815 1.0729609 -0.42275426
## age dis rad tax ptratio
## low -0.9389036 0.9761825 -0.6839876 -0.7250979 -0.5183503
## med_low -0.2596039 0.3421286 -0.5470944 -0.4702677 -0.0129253
## med_high 0.3690771 -0.3737602 -0.4156770 -0.2908389 -0.2418230
## high 0.8171042 -0.8609434 1.6397657 1.5152267 0.7826832
## black lstat medv
## low 0.37017907 -0.7589822 0.54925644
## med_low 0.34363031 -0.1155007 -0.02597870
## med_high 0.09247035 0.0630770 0.08243831
## high -0.79147372 0.9165597 -0.72859762
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.15901477 0.75113511 -0.87129239
## indus -0.01619110 -0.25626598 0.33512380
## chas -0.03429512 0.02889843 0.21280671
## nox 0.47068316 -0.77866504 -1.48659496
## rm 0.03336940 -0.01915254 0.01632860
## age 0.27577809 -0.24110376 0.06974958
## dis -0.12433986 -0.24291762 0.22997863
## rad 3.22685213 0.91133980 0.13139621
## tax -0.05615886 0.05565457 0.34705891
## ptratio 0.16612135 -0.03764257 -0.18171510
## black -0.10867202 -0.01615726 0.14891413
## lstat 0.16485939 -0.16034170 0.28994651
## medv 0.05713960 -0.35019593 -0.35987259
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9519 0.0369 0.0112
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)
Variable ‘rad’ looks like a strong classifying factor. Also ‘zn’ and ‘nox’ are dividing the observations.
Next I want to use the observations in the test set to predict crime classes. I do this because I want to estimate the “goodness” of my model by comparing predictions to observed “real” data.
For prediction I use the LDA model on the test data. For comparison I tabulate the results with the crime categories from the test set:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 12 3 0
## med_low 6 17 5 0
## med_high 0 10 15 1
## high 0 0 0 17
I did the random division of train and test data and predicted the above classes twice. First I got a fairly poor result with more than half of the med_high cases predicted incorrectly. On the second round the results look better (results shown here). Some classes are still incorrectly predicted but at least most of the predictions are correct.
Next I study the boston data without any classifications and try to cluster the data into groups. Maybe the observations form clusters according to the suburbs. I run k-means algorithm on the dataset, investigate what is the optimal number of clusters and run the algorithm again.
First I reload the Boston dataset and standardize it. Then I calculate the Euklidean distances between the observations and present a summary of the distances:
#standardize the data set
boston_scaled2 <- scale(Boston)
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2<-as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next I run the k-means clustering with 3 centers.
# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2[9:14], col = km$cluster)
I zoomed in to various parts of the plot and found that when looking at the variable ‘tax’ it is divided into clusters so that at least the black observations belong clearly to their own group.
I also explored the clustering with 5 centers. The grouping seemed even more arbitrary.
Now, I’m not sure about the best number of clusters so I count the total of within cluster sum of squares (WCSS) and see how it behaves when the number of clusters change:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
# set values
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS drops dramatically at around the value 2. That is the optimal number of clusters for this dataset.
I run the clustering again with 2 centers:
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster)
Now the clustering seems better, at least for some variable pairs. But on my opinion, having only two groups doesn’t tell much. Maybe it suggests that the residents in Boston are divided into two groups, the wealthy and the poor?
Next I perform the LDA again to the boston dataset, this time with clusters (3) as the target variable. By visualizing the results with a biplot I can interpret which variables influence the clustering.
boston_scaled3<-boston_scaled2
# k-means clustering
km <-kmeans(boston_scaled3, centers = 3)
klusteri<-km$cluster
class(klusteri)
## [1] "integer"
boston_scaled3<-cbind(boston_scaled3,klusteri)
summary(boston_scaled3)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv klusteri
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:1.000
## Median : 0.3808 Median :-0.1811 Median :-0.1449 Median :2.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :1.972
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:3.000
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865 Max. :3.000
# linear discriminant analysis
lda.fit2 <- lda(klusteri~., data = boston_scaled3)
# print the lda.fit object
lda.fit2
## Call:
## lda(klusteri ~ ., data = boston_scaled3)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3003953 0.4268775 0.2727273
##
## Group means:
## crim zn indus chas nox rm
## 1 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456
## 3 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125
## age dis rad tax ptratio black
## 1 0.7828949 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974
## 2 0.1673019 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397
## 3 -1.1241828 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235
## lstat medv
## 1 0.90799414 -0.69550394
## 2 -0.05818052 -0.04811607
## 3 -0.90904433 0.84137443
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.043702606 0.16161136
## zn 0.049248495 0.76920932
## indus -0.331498698 0.02870425
## chas -0.012406954 -0.11314905
## nox -0.721972554 0.40566595
## rm 0.174541989 0.41632858
## age 0.006221178 -0.88117192
## dis 0.043869924 0.36910493
## rad -1.256861546 0.47665247
## tax -0.992855786 0.46457291
## ptratio -0.092336951 -0.01003010
## black 0.073915653 -0.03513128
## lstat -0.372145848 0.38403679
## medv -0.058153798 0.49571753
##
## Proportion of trace:
## LD1 LD2
## 0.8785 0.1215
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled3$klusteri)
# plot the lda results
plot(lda.fit2, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit2, myscale = 1)
From these results I would interpret that the variable ‘rad’ (index of accessibility to radial highways) is the strongest linear separator in this dataset. Although many other variables follow not far behind.
Next I’ll draw some 3D plots of the training data. NOTE! You might have to click around to see the figure.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: package 'bindrcpp' was built under R version 3.4.4
#Set the color to be the crime classes of the train set. Draw another 3D plot where the #color is defined by the clusters of the k-means. How do the plots differ?
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train$crime)
I stop the exercise here because I’m having trouble understanding the instructions. I’m able to draw these two 3D plots and crime seems to be a strong separator in the dataset. The last plot should demonstrate the division by clusters. However, I’m not sure anymore should I do the k-means clustering again to the training data and then change the plotting code or could I do it just by modifying the color argument. I leave it here.